## 
## The downloaded binary packages are in
##  /var/folders/kq/s3xx2qyd4d30jsg1ym75hqfw0000gp/T//RtmpLDrsUf/downloaded_packages

Overview

We’ll begin by reviewing concepts and labs from last week.

Then we’ll start talking about strategies for making causal claims in observational studies.

  • Not everything we want to make causal claims about is amenable to random assignments
  • But, the conditions created by an experiment provide a benchmark to assess the credibility of a study’s causal claims
  • All observational studies that make causal claims are make some assumption that the thing they’re interested is as-if randomly assigned
  • Specific features of a study’s design make this assumption more or less credible.
    • Natural experiment: Nature randomized the thing I care about (think of lotteries to get into select schools) or created some exogenous source of variation
    • A kitchen sink regression: I’ve controlled for all the relevant factors that matter. How do I know? Because I said so.

After today, you should be able to explain:

  • The basic intuition behind the concept of conditional ignorability
  • Have a sense of what it means to control for some variable
  • Understand the idea of confounding variables
  • Develop a familiarity with the following tools for making casual claims with observational data:
    • Regression
    • Difference-in-difference
    • Regression discontinuity
    • Instrumental variables

Review

Conceptual review

The key conceptual points from last week are:

  • Causation involves making claims about counterfactuals
  • The fundamental problem of causal inference (FPoCI) is that for any individual, we can only see one of potentially many possible outcomes
  • Randomization is the key to estimating average causal effects.
  • In experiments, researchers intervene on the world by randomly assigning subjects to treatment and control
  • Random assignment addresses concerns about selection bias
  • It creates two groups that on average, are as similar as possible but for the thing who’s causal effect we wish to know.

Lab Review

  • We’ll review the results in 02_lab_2_comments.Rmd

Interval vs External Validity

  • Experiments have high internal validity
    • Are study provides a valid (unbiased) estimate of the thing we’re trying to learn
    • External validity is a more mushy concept
      • Generalizability
      • Ecological validity
    • Field experiments are thought to be more “externally valid” than lab or survey experiments
  • In any study, we can think about relative tradeoffs between internal and external validity

Credible Designs

  • In this class, we will evaluate the credibility of studies by the features of their design
  • What makes B&K credible?
    • Randomization with a placebo control
    • Multiple measures and periods of analysis
    • Second intervention
  • What makes B&K less credible?
    • Measurement (Attitudes vs Behavior)
    • Attrition
    • Questions about time, place, context, that may limit the generalizability of their results
  • What’s a meaningful difference?
    • What’s a confidence interval?

Causal Inference in Observational Studies

Observational versus experimental studies

  • In an observational study the researcher does not control the treatment assignment
  • No guarantee that treatment (D=1) and control groups (D=0) are comparable (That is that we’re comparing like with like)
  • Instead, we have justify our claims by theory and assumption rather than direct manipulation

(P)Review: Conditional ignorability

If treatment is not randomly assigned then:

\[ Y_i(1),Y_i(0) \text{ is not} \perp D_i \]

However, in some situations, it may be plausible to claim that conditional on some variable(s) \(X\), the distribution of potential outcomes \(Y\) is the same (independent) across levels of treatment \(D\) (conditional ignorability)

\[ Y_i(1),Y_i(0) \perp D_i |X_i \]

  • Conditional on a some covariate(s) \(X_i\) our treatment is as-if randomized

As-if randomized

  • The claim that treatment is as-if randomized requires further justification in the theory and design of your study

\[ Y_i(1),Y_i(0) \perp D_i |X_i \]

  • While we can’t “prove” this assumption, we typically can test some observable implications of this claim, specifically, things like covariate balance

Covariate balance

Recall that true randomization implies ignorability/independence between treatment and potential outcomes

\[ Y_i(1),Y_i(0) \perp D_i \]

Such that not only does

\[ E[Y(1)|D=1]=E[Y(1)]=E(Y(1)|D=0) \]

But also

\[ E[X_i|D=1]=E[X_i]=E(X_i|D=0) \]

In words, the distribution of pre-treatment (why?) covariates (both observed and unobserved) should be similar across treatment and control groups.

While we can’t prove our assumption of as-if randomization :

\[ Y_i(1),Y_i(0) \perp D_i |X \]

We can test its observable implications.

Suppose we were interested in the effects of some job training program on future earnings (Y). Suppose younger people are more likely to have completed this program than older people.

Just looking at outcomes between participants and non-participants, would conflate the effects of the program with the effects of age (and the things correlated with age like education).

So we might expect that levels of education between these groups would vary.

\[ E[Education|D=1]\neq E(Education|D=0) \]

But if age was the only thing that distinguished participants from non-participants, then \(Y_i(1),Y_i(0) \perp D_i |X\), we could estimate a conditional average treatment effect

\[ CATE= E[Y|D=1,Age]-E(Y|D=0,Age) \]

Further, to test our assumption that conditional on age, job-training recipients are the same as non-job training recipients, would could look at the conditional distributions of other covariates, like education. If our assumption holds, we would expect that

\[ E[Education|D=1,Age] = E(Education|D=0,Age) \]

Or that the difference between these means is small enough that it’s plausible to have arisen just by chance, making our claim that \(Y_i(1),Y_i(0) \perp D_i |X\) is more credible.

We’d hope this type of equality

\[ E[Education|D=1,Age]=E(Education|D=0,Age) \]

Were true for all covariates. We can test it for observed covariates (things we call \(X\)), but may still worry about unobserved covariates (things we call \(U\)) like innate ability

\[ E[Ability|D=1,Age]\neq E(Ability|D=0,Age) \]

Great, but how do we do this R

You already have, combining the mean() function with logical indexes

# Load data from last lab
load("fulldata.rda")
# Subset data
df <- df[!is.na(df$treatment_group), ]

# Uncondition mean
mean(df$therm_trans_t1,na.rm=T)
## [1] 56.7972
# Conditional mean for those who received the intervetnion
# E(therm_trans_t1 |treatment_group = "Trans Equality")
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality"],na.rm=T)
## [1] 60.18627
# Conditional means by treatment and age
# E(therm_trans_t1 |treatment_group = "Trans Equality" age > 30)

# Olds
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality" & df$vf_age > 30],na.rm=T)
## [1] 58.73529
mean(df$therm_trans_t1[df$treatment_group == "Recycling" & df$vf_age > 30],na.rm=T)
## [1] 52.80791
# Youths
mean(df$therm_trans_t1[df$treatment_group == "Trans-Equality" & df$vf_age <= 30],na.rm=T)
## [1] 67.44118
mean(df$therm_trans_t1[df$treatment_group == "Recycling" & df$vf_age <= 30],na.rm=T)
## [1] 57.10417

Conditioning and stastical control in observational studies

Directed Acyclic Graphs

Directed acyclic graphs are way of representing the causal relationships between variables

  • Paths: A route that connects variables (D to X to Y)
  • Causal if arrows all point in same direction
  • “The effect of D on Y is entirely mediated by M”

Let’s consider a simple example examining the relationship between ice cream sales and crime

Common Causes: Ice cream and crime

  • When ice cream sales go up so does crime
  • We all know sherbet is terrible, but does ice cream really cause crime?

  • Probably not. Both crime and ice cream sales have a common cause: Summer (hotter weather, more people outside, the mechanism’s murky)
  • In general, two variables with a common cause will have a marginal relationship: \(Pr[Y=y|D=1]\neq Pr[Y=y|D=0]\)
  • Correlation \(\neq\) Causation

Confounding Relationships

  • If X is a common cause of D and Y, often called a confounding variable because the true causal effect of D on Y is confounded by each variable’s relationship with X.
  • What do we do?

  • If we can condition on X, we can identify the effect of D on Y

  • Great! So how do we condition on variables?
  • One approach is to hold the variable we want to condition on (X) constant at some value (c) or ranges of values and look at relationships where (e.g. E(Y|D=1, X=c)-E(Y|D=0,X=c))
    • Only look at the relationship between ice cream sales and crime on days with the same temperature
    • In our lab, only look at villages within some distance of the border.
  • As we’ll discuss next class, this process of subclassification has some drawbacks and so we’ll discuss alternative strategies and approaches.

Conditioning on covariates

The GQ approach

Subclassification

  • Our goal is to approximate the experimental ideal of covariate balance so that we’re comparing like with like.
  • One approach is to subset the data by groups (or strata) and then look at the relationship between D and Y within groups (e.g. E(Y|D=1,X=x))
  • Pros:
    • Conceptually simple
  • Cons:
    • What if your variable is continuous?
    • What if you have lots of potential confounders?

Covariate adjustment

  • Parametric/model-based approaches
    • Regression
  • Pros: Easy to implement and interpret
  • Cons: Assumes you have the correct statistical model…
  • Semi/less parametric approaches
    • Matching
      • Exact (similar)
      • Coarsened (“bin” the data)
      • Propensity score (summarize the data with a propensity score model)
      • Distanced-based approaches (summarize the data using a distance measure)
  • Pros: More flexible
  • Cons: Model dependent, only guarantees balance on observed covariates

Observed and unobserved confounders

What do me mean when we say something is model dependent?

  • Here we’ve conditioned on X, but have no guarantee that some lurking \(U\) might also confound our estimates

Strategies for making causal claims in observational studies

A Typology of Observational Studies

  • Cross-sectional
  • Longitudinal
  • Natural Experiments

Cross-sectional

  • Observational data at single point in time (cross-section)
  • Methods: regression, matching
  • Concerns: Selection on observables, omitted variable bias

Conditioning on (Controlling for) a Variable

Longitudinal

  • Observational data over multiple points in time
  • Methods: fancy regression models (fixed and random effects, VAR, …), Difference-in-Difference designs
  • Concerns: Possible to control for time invariant covariates, but what about unobserved, time-varying covariates?

Difference in Difference Designs

True natural experiments

  • Nature cooperates to create situation in which treatment is randomly assigned
  • Methods: Regression, non-parametric estimators
  • Concerns: Nature is rarely so kind; Researcher doesn’t control assignment so are we sure it’s really random?

Discontinuities

  • Nature creates some discontinuity (like the line between Occupied and Vichy France), such that observations close to that discontinuity are plausibly similar, except for the fact that one group receives treatment and the other does not
  • Concerns: Estimating a “local” ATE (i.e. true for cases close to the discontinuity); Is local as-if randomization a reasonable assumption?

Instrumental Variables

  • Z is potential instrument for the effect of D on Y
  • Exclusion restriction: No common cause between Z and Y; Only path from Z to Y is through D
  • Very hard to find “good” instruments